Intro to STAT 331/531 + Intro to R

Wednesday, April 3

Today we will…

Introductions

Me!

Hi, I’m Dr. Rehnberg!

  • I am a transplant to the west coast – PA to MO to MI to CA.

  • My favorite things are being outside, drinking tea, and watching reality tv.

On a personal note…

I have a genetic, degenerative eye disease called Stargardt disease, which causes me to have poor vision, even with corrective lenses.

What this means for you:

  • When I am helping you on your computer, please make the font large and turn the brightness up.

  • I have difficulty recognizing faces – please be patient!

Questions?

Our Classroom Learning Assistant!

We will be joined in class by Libby.

Libby is…

  • A second-year Statistics major pursuing a Data Science minor.

  • Originally from San Ramon in the East Bay Area.

  • A golfer, dancer, and crocheter!

You!

I am looking forward to reading your introductions on Canvas Discussions!

  • Please read the intros of your classmates so you can discover who you will be learning with this quarter.

Syllabus

Intro to R

What is R?

  • R is a programming language designed originally for statistical analyses.
  • R was created by Ross Ihaka and Robert Gentleman in 1993.
    • Their names are why it’s called R.
  • R was formally released by the R Core Group in 1997.
    • This group of 20-ish volunteers are the only people who can change the base (built-in) functionality of R.

Strengths

R’s strengths are…

… handling data with lots of different types of variables.

… making nice and complex data visualizations.

… having cutting-edge statistical methods available to users.

Weaknesses

R’s weaknesses are…

… performing non-analysis programming tasks, like website creation (python, ruby, …).

… hyper-efficient numerical computation (matlab, C, …).

… being a simple tool for all audiences (SPSS, STATA, JMP, minitab, …).

But wait!

Packages

The heart and soul of R are packages.

  • These are “extra” sets of code that add new functionality to R when installed.
  • “Official” R packages live on the Comprehensive R Archive Network, or CRAN.
  • But anyone can write and share new code in “package form”.

Packages

To install a package use:

install.packages("tibble")
  • You should have to install a package only once.

To load a package use:

library(tibble)
  • You have to load a package each time you restart R.

Open-Source

Importantly, R is open-source.

  • There is no company that owns R, like there is for SAS or Matlab.
    • (Python is also open-source!)
  • This means nobody can sell their R code!
    • But you can sell “helpers” like RStudio.
    • And you can keep code private within an organization or company.

This means packages are created by users like you and me!

Intro to RStudio

What is RStudio?

RStudio is an IDE (Integrated Developer Environment).

  • This means it is an application that makes it easier for you to interact with R.

Directories & Scientific Reproducibility

What is a directory?

  • A directory is just a fancy name for a folder.

  • Your working directory is the folder that R “thinks” it lives in at the moment.

    • If you save things you have created, they save to your working directory by default.
getwd()
[1] "/Users/zrehnber/Documents/Teaching/Stat_331/material/lecture_slides/W1_intro_R"

Paths

  • A path describes where a certain file or directory lives.
[1] "/Users/zrehnber/Documents/Teaching/Stat_331/material/lecture_slides/W1_intro_R"

This file lives in my user files Users/

…on my account zrehnber/

…in my Documents folder …

…in a series of organized folders.

Manage your Class Directory

Create a directory for this class!

  • Is it in a place you can easily find it?

  • Does it have an informative name?

  • Are the files inside it well-organized?

The Beauty of R Projects

  • An R Project is basically a “flag” planted in a certain directory.

  • When you double click an .Rproj file, it:

  1. Opens RStudio

  2. Sets the working directory to be wherever the .Rproj file lives.

  3. Links to GitHub, if set up (more on that later!)

R Projects & Reproducibility

R Projects are great for reproducibility!

  • You can send anyone your folder with your .Rproj file and they will be able to run your code on their computer.

  • We will be using R Projects throughout this course.

Principles of Reproducibility

You can to send your project to someone else, and they can jump in and start working right away.

  • This means:

    1. Files are organized and well-named.

    2. References to data and code work for everyone.

    3. Package dependency is clear.

    4. Code will run the same every time, even if data values change.

    5. Analysis process is well-explained and easy to read.

Setting up an R Project

Good practice

  • Organize your folders carefully, and name them meaningfully:
    • /User/zrehnber/Stat331/lab1/ rather than Desktop/stuff/
  • Use R Projects liberally - put one in the “main” folder for each project

Bad practice

If you put something like this at the top of your .qmd file (more on Quarto later), I will set your computer on fire:

setwd("/User/reginageorge/Desktop/R_Class/Lab_1/")
  • Setting working directory by hand = BAD!

  • That directory is specific to you!

  • Quarto ignores this code when knitting!

R Basics

Data Types

  • A value is a basic unit of stuff that a program works with.

  • Values have types:

  1. logical / boolean: FALSE/TRUE or 0/1 values.
  1. integer: whole numbers.
  1. double / float / numeric: decimal numbers.
  1. character / string - holds text, usually enclosed in quotes.

Variables

Variables are names that refer to values.

  • A variable is like a container that holds something - when you refer to the container, you get whatever is stored inside.

  • We assign values to variables using the syntax object_name <- value.

    • You can read this as “object name gets value” in your head.
message <- "So long and thanks for all the fish"
year <- 2025
the_answer <- 42
earth_demolished <- FALSE

Data Structures

Homogeneous: every element has the same data type.

  • Vector: a one-dimensional column of homogeneous data.

  • Matrix: the next step after a vector - it’s a set of homogenous data arranged in a two-dimensional, rectangular format.

Heterogeneous: the elements can be of different types.

  • List: a one-dimensional column of heterogeneous data.

  • Dataframe: a two-dimensional set of heterogeneous data arranged in a rectangular format.

Indexing

We use square brackets ([]) to access elements within data structures.

  • In R, we start indexing from 1.

Vector:

vec[4]    # 4th element
vec[1:3]  # first 3 elements

Matrix:

mat[2,6]  # element in row 2, col 6
mat[,3]   # all elements in col 3

List:

li[[5]]    # 5th element

Dataframe:

df[1,2]     # element in row 1, col 2
df[17,]     # all elements in row 17
df$calName  # all elements in the col named "colName"

Logic

We can combine logical statements using and, or, and not.

  • (X AND Y) requires that both X and Y are true.

  • (X OR Y) requires that one of X or Y is true.

  • (NOT X) is true if X is false, and false if X is true.

x <- c(TRUE, FALSE, TRUE, FALSE)
y <- c(TRUE, TRUE, FALSE, FALSE)

x & y   # AND
[1]  TRUE FALSE FALSE FALSE
x | y   # OR
[1]  TRUE  TRUE  TRUE FALSE
!x & y  # NOT X AND Y
[1] FALSE  TRUE FALSE FALSE

Troubleshooting Errors!

Messages + Wwarnings + Errors

Messages

Just because you see scary red text, this does not mean something went wrong! This is just R communicating with you.

  • For example, you will often see:
library(lme4)
Loading required package: Matrix

Warnings

Often, R will give you a warning.

  • This means that your code did run…

  • …but you probably want to make sure it succeeded.

Does this look right?

my_vec <- c("a", "b", "c")

my_new_vec <- as.integer(my_vec)
Warning: NAs introduced by coercion
my_new_vec
[1] NA NA NA

Errors

If the word Error appears in your message from R, then you have a problem.

  • This means your code could not run!
my_vec <- c("a", "b", "c")

my_new_vec <- my_vec + 1
Error in my_vec + 1: non-numeric argument to binary operator

Syntax Errors

Did you leave off a parenthesis?

seq(from = 1, to = 10, by = 1

seq(from = 1, to = 10, by = 1
Error: <text>:2:0: unexpected end of input
1: seq(from = 1, to = 10, by = 1
   ^
seq(from = 1, to = 10, by = 1)
 [1]  1  2  3  4  5  6  7  8  9 10

Did you leave off a comma?

seq(from = 1, to = 10 by = 1)

seq(from = 1, to = 10 by = 1)
Error: <text>:1:23: unexpected symbol
1: seq(from = 1, to = 10 by
                          ^
seq(from = 1, to = 10, by = 1)
 [1]  1  2  3  4  5  6  7  8  9 10

Did you make a typo? Are you using the right names?

sequence(from = 1, to = 10, by = 1)

sequence(from = 1, to = 10, by = 1)
Error in sequence.default(from = 1, to = 10, by = 1): argument "nvec" is missing, with no default
seq(from = 1, to = 10, by = 1)
 [1]  1  2  3  4  5  6  7  8  9 10

Object Type Errors

Are you using the right input that the function expects?

sqrt(‘1’)

sqrt('1')
Error in sqrt("1"): non-numeric argument to mathematical function
sqrt(1)
[1] 1

Are you expecting the right output of the function?

my_obj <- seq(from = 1, to = 10, by = 1)

my_obj(5)
Error in my_obj(5): could not find function "my_obj"
my_obj[5]
[1] 5

Common Errors

R says…

Error: Object some_obj not found.

It probably means…

You haven’t run the code to create some_obj OR you have a typo in the name!

some_ojb <- 1:10

mean(some_obj)
Error in h(simpleError(msg, call)): error in evaluating the argument 'x' in selecting a method for function 'mean': object 'some_obj' not found

R says…

Error: Object of type ‘closure’ is not subsettable.

It probably means…

Oops, you tried to use square brackets on a function

mean[1, 2]
Error in mean[1, 2]: object of type 'closure' is not subsettable

R says…

Error: Non-numeric argument to binary operator.

It probably means…

You tried to do math on data that isn’t numeric.

"a" + 2
Error in "a" + 2: non-numeric argument to binary operator

What if none of these solved my error?

  1. Look at the help file for the function!

  2. When all else fails, Google your error message.

  • Leave out the specifics.

  • Include the function you are using.

Try it…

What’s wrong here?

matrix(c("a", "b", "c", "d"), num_row = 2)
Error in matrix(c("a", "b", "c", "d"), num_row = 2): unused argument (num_row = 2)

Scripts + Notebooks

Scripts

  • Scripts (File > New File > R Script) are files of code that are meant to be run on their own.
  • Scripts can be run in RStudio by clicking the Run button at the top of the editor window when the script is open.

  • You can also run code interactively in a script by:

    • highlighting lines of code and hitting run.

    • placing your cursor on a line of code and hitting run.

    • placing your cursor on a line of code and hitting ctrl + enter or command + enter.

Notebooks

Notebooks are an implementation of literate programming.

  • They allow you to integrate code, output, text, images, etc. into a single document.

  • E.g.,

    • R Markdown notebook
    • Quarto notebook
    • Jupyter notebook

Reproducibility!

What is Markdown?

Markdown (without the “R”) is a markup language.

  • It uses special symbols and formatting to make pretty documents.

  • Markdown files have the .md extension.

What is Quarto?

Quarto uses Markdown, AND it can run and display R code.

  • (Other languages, too!)
  • Quarto files have the .qmd extension.

Highlights of Quarto

  • Consistent implementation of attractive and handy features across outputs:

    • E.g., tabsets, code-folding, syntax highlighting, etc.
  • More accessible defaults and better support for accessibility.

  • Guardrails that are helpful when learning:

    • E.g., YAML completion, informative syntax errors, etc.
  • Support for other languages like Python, Julia, Observable, and more.

Quarto Formats

Quarto makes moving between outputs straightforward.

  • All that needs to change between these formats is a few lines in the front matter (YAML)!

Document

title: "Lesson 1"
format: html

Presentation

title: "Lesson 1"
format: revealjs

Website

project:
  type: website

website: 
  navbar: 
    left:
      - lesson-1.qmd

Quarto Components

Markdown in Quarto

A few useful tips for formatting the Markdown text in your document:

  • *text* – makes italics
  • **text** – makes bold text
  • # – makes headers
  • ![ ]( ) – includes images or HTML links
  • < > – embeds URLs

R Code Options in Quarto

R code chunk options are included at the top of each code chunk, prefaced with a #| (hashpipe).

  • These options control how the following code is run and reported in the final Quarto document.
  • R code options can also be included in the front matter (YAML) and are applied globally to the document.

Rendering your Quarto Document

To take your .qmd file and make it look pretty, you have to render it.

Rendering your Quarto Document

Quarto CLI (command line interface) orchestrates each step of rendering:

  1. Process the executable code chunks with either knitr or jupyter.
  2. Convert the resulting Markdown file to the desired output.

Rendering your Quarto Document

When you click Render:

  • Your file is saved.
  • The R code written in your .qmd file gets run in order.
    • It starts from scratch, even if you previously ran some of the code.
  • A new file is created.
    • If your Quarto file is called “Lab1.qmd”, then a file called “Lab1.html” will be created.
    • This will be saved in the same folder as “Lab1.qmd”.

PA 1: Find the Mistakes

The components of the Practice Activity are described below:

Part One:

This file has many mistakes in the code. Some are errors that will prevent the file from knitting; some are mistakes that do NOT result in an error.

Fix all the problems in the code chunks.

Part Two:

Follow the instructions in the file to uncover a secret message.

Submit the name of the poem as the answer to the Canvas quiz question.

Lab 1: Introduction to Quarto

To do…

  • PA 1: Find the Mistakes
    • Due Saturday (4/6) at 11:59pm
  • Lab 1: Introduction to Quarto
    • Due Saturday (4/6) at 11:59pm
  • Read Chapter 2: Importing Data + Basics of Graphics
    • Check-in 2.1 + 2.2 due Monday (4/8) at 10:00am